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Abstract. The adaptive Metropolis (AM) algorithm of Haario, Saksman and 
Tamminen [Bernoulli 7 (2001) 223-242] uses the estimated covariance of the tar- 
get distribution in the proposal distribution. This paper introduces a new robust 
adaptive Metropolis algorithm estimating the shape of the target distribution and 
simultaneously coercing the acceptance rate. The adaptation rule is computation- 
ally simple adding no extra cost compared with the AM algorithm. The adaptation 
strategy can be seen as a multidimensional extension of the previously proposed 
method adapting the scale of the proposal distribution in order to attain a given ac- 
ceptance rate. The empirical results show promising behaviour of the new algorithm 
in an example with Student target distribution having no finite second moment, 
where the AM covariance estimate is unstable. In the examples with finite second 
moments, the performance of the new approach seems to be competitive with the 
AM algorithm combined with scale adaptation. 

1. Introduction 

Markov chain Monte Carlo (MCMC) is a general method to approximate integrals 
of the form 

I := / f(x)ir(x)dx < oo 

where ir is a probability density function, which can be evaluated point-wise up to a 
normalising constant. Such an inte gral occurs frequently when computing Bayesian 



posterior expectations [e.g., Il2|, [20|, |22j. The MCMC method is based on a Markov 
chain {X n ) n >\ that is easy to simulate in practice, and for which the ergodic averages 
I n := n~ l Y^k=i f{Xk) converge to the integral / as the number of samples n tends 
to infinity. 

One of the most generally applicable MCMC method is the random walk Metropolis 
(RWM) algorithm. Suppose q is a symmetric probability density supported on M d (for 
example the standard Gaussian density) and let S G M. dxd be a non-singular matrix. 
Set Xi = Xi, where x\ G M. d is a given starting point in the support; ir(xi) > 0. For 
n > 2 apply recursively the following two steps: 
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(Ml) simulate Y n = X n _i + SU n , where U n ~ q is a independent random vector, and 

(M2) with probability a n := a(X n _i,F n ) := min{l, 7r(y n )/7r(X n _i)} the proposal is 
accepted, and X n = Y n ; otherwise the proposal is rejected and X n = X n _\. 

This algorithm will produce a valid chain, that is, I n — y I almost surely as n — > oo 



[e.g. [19|, Theorem 1]. However, the efficiency of the method, that is, the speed of the 
convergence I n — > I, is crucially affected by the choice of the shape matrix S. 

Recently, there has been an increasing interest on adaptive MCMC algorithms 
that try to learn some properties of the target distribution tt on-the-ffy, and use this 
information to facilitate more efficient sampling 0, 0, 0, E H |24| ; see also the recent 



review by |3[. In the context of the RWM algorithm, this is typically implemented 
by replacing the constant shape S in (MTJ with a random matrix 5 n _i that depends 
on the past (on the random variables Uk, Xf;, and for 1 < k < n — 1). 

Different strategies have been proposed to compute the matrix S n -x. The sem- 
inal Adaptive Metropolis (AM) algorithm [l3| uses S n -i = 9L n -\ where L n -i is 
the Cholesky factor of the (possibly modified) empirical covariance matrix C n _i = 
Cov(Ai, . . . , X n -i). Under certain assumptions, the empirical covariance converges 
to the true covariance of the target distribution tt [see, e.g., S, H H 0. The 



constant scaling parameter 9 > is a tuning parameter chosen by the user; the value 
9 = 2A/\fd proposed in the original pa per is widely used, as it is asymptotically 
optimal under certain theoretical setting [11} . 

In fact, the theory behind the value 9 = 2A/y/d connects the mean acceptance rate 
to the efficiency of the Metropolis algorithm in more general settings. Therefore, it 
is sensible to try to find such a scaling factor 9 that yields a desired mean acceptance 



rate; typically 23.4% in multidimensional settings 25]. The first algorithms coercing 
the acceptance rate did not adapt the shape factor at all, but only the scale of the 
proposal distribution. That is, S n -\ = 9 n -\I, a multiple of a constant matrix, where 
the factor n _i G (0, oo) is adapted roughly by increasing the value of the acceptance 
probability is too low, and vice versa |3_|, [g, LTj, |24j . This adaptive scaling Metropolis 
(ASM) algorithm has some nice properties, and it has been shown that the algorithm 
is stable under quite a general setting j28[. It is, however, a 'one-dimensional' scheme, 
in the sense that it is unable to adapt to the shape of the target distribution like the 
AM algorithm. This can result in slow mixing with certain target distributions tt 
having a strong correlation structure. 

The scale adaptation in the ASM approach has been proposed to be used within 
the AM algorithm 0, 0]. This algorithm, which shall be referred here to as the 
adaptive scaling within AM (ASWAM), combines the shape adaptation of AM and 
the acceptance probability optimisation. Namely, S n -\ = 9 n _ 1 L n _ 1 , where 9 n -\ is 
computed from the observed acceptance probabilities a 2 , ■ ■ ■ , a n -i and L n _i is the 
Cholesky factor of Cov(Ai, . . . , X n -i). This multi-criteria adaptation framework pro- 
vides a coerced acceptance probability, and at the same time captures the covariance 
shape information of tt. Empirical findings indicate this algorithm can overcome some 



difficulties encountered with the AM method 13 
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The present paper introduces a new algorithm alternative to the ASWAM ap- 
proach. The aim is to seek a matrix factor S 1 * that captures the shape of it and at the 
same time allows to attain a given mean acceptance rate. Unlike the multi-criteria 
adaptation in ASWAM, the new approach is based on a single matrix update formula 
that is computationally equivalent to the covariance factor update in AM. The algo- 
rithm, called here the robust adaptive Metropolis (RAM), differs from the ASWAM 
approach by avoiding the use of the empirical covariance, which can be problematic 
in some settings, especially if ir has no finite second moment. The proposed approach 
is reminiscent, yet not equivalent, with robust pseudo-covariance estimation, which 
has also been proposed to be used in place of the AM approach Q]. 

The RAM algorithm is described in detail in the next section. Section [3] provides 
analysis on the stable points of the adaptation rule, that is, where the sequence of 
matrices S n is supposed to converge. In Section HJ the validity of the algorithm 
is verified under certain sufficient conditions. It is also shown that the adaptation 
converges to a shape of an elliptically symmetric target distribution. The RAM 
algorithm was empirically tested in some example settings and compared with the 
AM and the ASWAM approaches. Section [5] summarises the encouraging findings. 
The final section concludes with some discussion on the approach as well as directions 
of further research. 



In what follows, suppose that the proposal density q is spherically symmetric: 
there exists a function q : R — > [0, oo) such that q(x) = q(\\x\\) for all x G M. d . Let 
si G M. dxd be a lower- diagonal matrix with positive diagonal elements, and suppose 
{Vn}n>i C (0, 1] is a step size sequence decaying to zero. Furthermore, let x\ G M. d 
be some point in the support of the target distribution, tc(xi) > 0, and let a* G (0, 1) 
stand for the target mean acceptance probability of the algorithm. 

The robust adaptive Metropolis process is defined recursively through 

(Rl) compute Y n := A n _i + S n -iU n , where U n ~ q is an independent random vector, 
(R2) with probability a n := min{l, 7r(F n )/7r(X n _i)} the proposal is accepted, and 

X n := Y n ; otherwise the proposal is rejected and X n := A n _i, and 
(R3) compute the lower-diagonal matrix S n with positive diagonal elements satisfying 

the equation 



where I G M dxd stands for the identity matrix. 

The steps (RQ]) and (RT5D implement one iteration of the RWM algorithm, but with 
a random matrix S n -\ in (RQ]). In the adaptation step (RSI) the unique S n satisfying 
([1]) always exists, since it is the Cholesky factor of the matrix in the right hand side, 
which is verified below to be symmetric and positive definite. 



2. Algorithm 



(1) 
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Figure 1. Two examples of the RAM update (RJHj). The solid line 
represents the contour ellipsoid defined by 5 n _iS'J_ 1 , and the vector 
S n -iU n /\\U n \\ is drawn as a dot. The contours defined by S n S^ are 
dashed. 



Proposition 1. Suppose S G M. dxd is a non-singular matrix, u G lR d is a non-zero 
vector and a G (— l,oo) is a scalar. Then, the matrix M := S(l + a^^jS 7 is 
symmetric and positive definite. 

Proof. The symmetricity is obvious. Let x G M. d \ {0}, denote u '■— jr^r and define 
z ■— Su. We may write M = SS T + azz T , whence 

x T Mx = ||x T 5|| 2 + a(x T z) 2 = \\x T S\\ 2 ( 1 - ~ 



This already establishes the claim in the case a > 0. Suppose then a G (—1,0). 
Clearly (x T ^) 2 = \\x T Su\\ 2 < \\x T S\\ 2 and so x T Mx > \\x T S\\ 2 (l - |a|) > 0. 

□ 

Let us then see what happens in the adaptation in intuitive terms. Observe first 
that in (HI]) the proposal Y n is formed by adding an increment W n '■— S n -\U n to the 
previous point X n _x. Since U n is distributed according to the spherically symmetric q, 
the random variable W n is distributed according to the elliptically symmetric density 
QSn~i( w ) := det(5 n _i) _1 g(S' n ^ 1 w) with the main axes defined by the eigenvectors and 
the corresponding eigenvalues of the matrix SVt-iS^-i- 

To illustrate the behaviour of the RAM update (R[3]), Figure [1] shows two examples 
how the contours of the proposal change in the update. The example on the left 
shows how the contour ellipsoid expands to the direction of S n U n when r\ n {a n — a*) = 
0.8 > 0. Similarly, the example on the right shows how the ellipsoid shrinks when 
Vn{c*n — ct*) — —0.8 < 0. These examples reflect the basic idea behind the approach. 
If the acceptance probability is smaller than desired, a n < a* (or more than desired, 
a n > a*) the proposal distribution is shrunk (or expanded) with respect to the 
direction of the current proposal increment. 

We can also see this behaviour from the update equation by considering the radius 
of the contour ellipsoid defined by S n S^ with respect to different directions. Let 
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v G M. d be a unit vector. As in the proof of Proposition HJ we may write 

ll^ll 2 = \\ s n-i v \\ 2 + Vn(a n - a^{Z T n vf 

where Z n = S n U n /\\U n \\. If Z n and v are orthogonal, the latter term vanishes and 
II^HI = ll^n-i 1 '!!- K they are parallel, that is, v = ±Z n /\\Z n \\, then the factor (Z^v) 2 
equals ||<S^_if || 2 , and so \\S^v \\ = + rj n (a n — a*)||S^ , _ 1 'y||. Any other choices of 
the unit vector v fall in between these two extremes. 

Remark 2. In dimension one, the value of S n can be computed directly by 

log S n = log 5 n _i + - log (l + r] n (a n - a*)) . 

When f] n is small, this is almost equivalent to the update 

log S n = log Sft-i + y(a„ - a*) 

implying that the RAM algorithm will exhibit a similar behaviour with the ASM 
algorithm as proposed by p and (3[ and analysed by (28[. Therefore, it is justified 
to consider RAM as a multidimensional generalisation of the ASM adaptation rule. 

Remark 3. In practice, the matrix S n in (R[3]) can be computed as a rank one Cholesky 
update or downdate of S n -i when a n — a* > and a n — a* < 0, respectively [lOj. 
Therefore, the algorithm is computationally efficient up to a relatively high dimension. 
In fact, the full d- dimensional matrix multiplication required when generating the 
proposal in (RCQ has the same 0(d 2 ) complexity as the Cholesky update or downdate, 
rendering the adaptation to only add a constant factor to the complexity of the RWM 
algorithm. 

Remark 4. While the step size sequence t] n can be chosen quite freely, in practice it 
is often defined as r] n = rT 1 with an exponent 7 G (1/2, Ik The choice 7 = 1, which 



is employed in the original setting of the AM algorithm [13j is not advisable for the 
RAM algorithm. For simplicity, consider a one- dimensional setting like in Remark 
[2j Then, if r\ n = rT 1 the logarithm of S n can increase or decrease only at the speed 
± 5^fc=i Vk ~ log(n). Therefore, S n can grow or shrink only linearly or at the speed 
1/n, respectively. This renders the adaptation inefficient, if the initial value s\ differs 
significantly from the the scale and shape of it. 



3. Stable points 

The RAM algorithm introduced in the previous section has, under suitable condi- 
tions, a stable point, that is, a matrix S* G M. dxd , where the adaptation process S n 
should converge as n increases. Before considering the convergence, we shall study 
the stable points of the algorithm in certain settings. 

One can write the update equation ([T]) in the following form 

(2) SnSn — S'n-lS'n-l + VnH(S n -i, X n -l, U n ) 



(> 
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where 

hyc ^ cf ■ fi n(x + Su) \ \ uu T T 
H(b,x,u) — b mm 1, r~ — > — a* } jr—r^b . 

The recursion ([2]) implements a so called Robbins-Monro stochastic approximation 
algorithm on (S n S^) n >i [e.g. H, 0, Such an algorithm seeks the root of the so 
called mean field h n defined as 

h n (S) := S [ [ ( min il, — - — ^— — -1 — a* ) v. — 7^q(u)duiT(x)dxS T . 

Jri Jri V I ^(x) J y \\u\\ 2 

We shall see that under some sufficient conditions, there exists a stable point, that 
is, h n (S) = 0. 

First, we shall observe a fundamental property of the RAM algorithm; that it is 
invariant under affine transformations. 

Theorem 5. Let i\ be a probability density and let (X n , S n ) n >i be the RAM process 
(R\^\j-(R\^j targeting tt and started from (x±, si). Suppose A 6 M. dxd is a non-singular 
matrix, b G M. d and define if(x) := | det(y4)| _1 7r(A _1 a; — b). Let (X n , S n ) n >i be the 
RAM process targeting n and started from (Axi + b, Asi). Then, the processes (AX n + 
b, (AS n )(AS n ) T ) n >i and (X n , S n S^) n >i have identical distributions. 

Proof. Let U n ~ q and W n ~ f7 (0, 1) be the independent sequences that drive the 
RAM process {X n: >S n )n>i targeting 7r; that is 

(3) Y n = X n _i + S n ^iU n 

(4) X n = Y n l {Wn < an} + X n l {Wn>an} . 

The proof proceeds by constructing an independent sequence U n ~ q, so that the 
RAM process (X n , S n ) n >i targeting n and driven by (£/„) n >i and (W n ) n >i will satisfy 
the claim path-wise: AX n = X n and AS n (AS n ) T = S n S^ for all n > 1. 

Write the QR decomposition (AS n ) T = Q n R n where Q n is orthogonal and where 
S n := R^ is lower- diagonal and chosen so that it has a positive diagonal. We 
observe that AS n (AS n ) T = S n S^ and defining U n +i '■= Q^U n+ i we have also 
AS n U n+ i = S n U n+ i. Since the distribution of U n+ i is spherically symmetric and 
U n+ i is independent of Q n , the sequence (U n ) n >i is i.i.d. with distribution q. 

Now, we may verify inductively using fl3]) and @ that X n = AX n can be computed 
through 

Y n X n _\ -\- S n —\U n 

X n = Y n ^{W n <&„} + ^n-l^{W n >& n } 

where 

a n = mm <^ 1, — } = mm <^ 1, — — > = a n . □ 

I tt(AVi)J I 7r(X n _!)J 

After Theorem [51 it is no surprise that the mean field of the algorithm satisfies 
similar invariance properties. 
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Theorem 6. Suppose tt is a probability density. 

(i) Let ft be an affine transformation of it, that is, n(x) = \ det(A)\~ 1 ir(A~ 1 x — b) for 
some non-singular matrix A G M. dxd and b G M. d . Then, Ah 7T (S)A T = h^(AS) 
for all S eR dxd . 

(ii) For any orthogonal matrix Q G M. dxd and for all S G M. dxd , h n (S) = h n (SQ). 
(Hi) Suppose that S is a unique lower- diagonal matrix with positive diagonal satisfy- 
ing h n (S) = 0. Then, restricted to such matrices, the solution of h#(S) = is 
also unique, and of the form S = ASQ for some orthogonal Q G R dxd . 

Proof. The claim (0) follows by a change of variable x = A~ l z — b, 

h n (S) — S [ [ (minil, — — y— — - \ — a* ) 7r(x)dxj. — rprq(u)duS T 



7T(X) J J \\U 



S I I [ min \ 1, ~t^ U ^ | — a^j n(z)dz ^^ q^duS 1 



A- l h*{AS)A 



-T 



The claim ^ follows from similarly, by a change of variable u = Qv and due to the 
spherical symmetry of q. The uniqueness up to rotations, that is, only the matrices 
of the form S = ASQ satisfy h%(S) = follows directly as above. The claim (lm|) is 
completed by writing the QR-decomposition (AS) T = QR. and by observing that 
the upper-triangular R can be chosen to have positive diagonal elements. □ 

Theorem [6] verifies that the stable points of the algorithm are affinely invariant like 



the covariance (or more generally robust pseudo-covariance) matrices [15j. Theorem [7] 
below verifies that in the case of a suitable elliptically symmetric target distribution it, 
the stable points of the RAM algorithm in fact coincide with the (pseudo-) covariance 
of it. This is an interesting connection, but in general the fixed points of the RAM 
algorithm are not expected to coincide with the pseudo-covariance. 

Theorem 7. Assume a* G (0, 1) and it is elliptically symmetric, that is, tt[x) = 
det(S) _1 p(||S^ 1 a;||) for somep : [0, oo) — > [0, oo) and for some symmetric and positive 
definite £ G M. dxd . Then, 

(i) there exists a lower- diagonal matrix with positive diagonal 5* G M. dxd such that 
h n (S*) = and such that S*Sj is proportional to S 2 . 

(ii) assuming the function p is non-increasing, the solution S 1 * is additionally unique. 

Proof. In light of Theorem [HI it is sufficient to consider any spherically symmetric it, 
that is, the case £ is an identity matrix. 

Let S be a lower-diagonal matrix with positive diagonal. Observe that since S is 
non-singular, h n (S) = is equivalent to S' _1 /i vr (S')S , ~ T = 0, that is 



(5) / / (min { 1, — "*~ f ^ i - aA ^~-rq(u)duTr(x)dx = 0. 

Jr" JRd \ I 7T{X) J J WuW 1 
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Define the function 

h(S):= [ [ (min il, — — — -\ J j — —q(u)duTx(x)dx. 

W V I ' n{x) J / IMI 

It is easy to see by symmetry and taking traces that ([5]) is equivalent to h(S) = 
where / G M. dxd stands for the identity matrix. 

We can write h(S) in a more convenient form by using the polar coordinate rep- 
resentation u = rv, where v G S d := {v G M d : ||u|| = 1} is a unit vector in the unit 
sphere, and r = ||u|| is the length of u. Then, by Fubini's theorem 



h(S) 



min {ti(x),tt(x + rSv )} dxq(r)dr 



v f T /i(df ) 



where \i stands for the uniform distribution on the unit sphere S d and the proposal 
is written as q(u) oc g(||«||). 

By applying the representation of tt by the radial function p one can write the term 
above in brackets as 

poo P 

g{^Sv^) := I I min {p(||a;||),p(||a; + r5v||)} dxq(r)dr, 
Jo JR d 

since due to symmetry, the value of the integral depends only on the norm \\Sv \\. 
For any 9 G IR+, one can now write 

9(0) r 



h(6I) = / g{9)vv 1 ^{dv) 



d 



since trace (h(9I)) = g{6) and by symmetry Proposition [201 in Appendix \K\ shows 
that g : (0, oo) — > (0, oo) is continuous, that lim^oo g{6) = and that \im e ^ 0+ g{6) = 
J °° q{r)dr = 1. Therefore, there exists a 9* > such that g(9*) = a* so that 
h(9J) = fl, establishing (0). 

For (Jn]), let us first show that g is in this case strictly decreasing, at least before 
hitting zero. Observe that since p is non- increasing, one can write 



d(0) = I \ / p(||x||)dx+ / p(\\x + r9v\\)dx \q(r)dr 

' \\x\\>\\x+r6v\\ J \\x\\<\\x+r9v\\ J 

Ti{x)dx J q(r)dr. 

v f 

It is easy to see that the width of the strip A r g v := {\\x\\ < \\x + r9v\\} fl {||x|| < 
||x — r^w||} is increasing with respect to 9. Therefore, for any fixed r and v, the 
term b TV {9) := 1 — J A 7r(x)dx is strictly decreasing with respect to 9 as long as the 
support of ii is not completely covered by A r9v , in which case b rv {9) = 0. This implies 
that g{9) is strictly decreasing with respect to 9, until possibly g{9) = 0. Therefore, 
there is a unique 9* > for which gifi*) = a*. 

Let us assume that 5* G M. dxd is a matrix satisfying h(S) = %J. By symmetry, 
we can assume S to be diagonal, with positive diagonal elements si,...,sa > 0. 
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Let ei,...,ed stand for the standard basis vectors of M. d . The diagonal element 
[h(S)]u = 2f is equivalent to 

/ [g(\\Sv\\)-a*](v T e i ) 2 f i(dv) = 0, 
Js d 

since J iSd (f T e i ) 2 /i(df ) = d~ l . Denoting p(||5?;||) := ^(|| ( S'v||) — a*, this implies 

(6) £^((Eti^ 2 ) 1/2 )(Sti^ 2 )M^) = o 

for any choice of the constants \ G 1L Particularly, choosing \ = 1 for i = 1, . . . , d 
implies that for any constant cGlwe have 

(7) Jj((ELsW) 1/2 )cKdv)=0. 



Now, summing ([6j) and (EJ) with a specific choice of constants c = Q\ and A, = — s 2 , 
we obtain 



But now, ^((X^i=i sf' 1 '?) 1 ^ 2 ) > exactly when ^i=i s f t 'i — so the integrand 
is always non- negative. Moreover, if any Sj 7^ 0*, then by continuity there is a 
neighbourhood C/j C S d of such that the integrand is strictly positive, implying 
that the integral is strictly positive. This concludes the proof of the uniqueness 
©. □ 

The following theorem shows that when ir is the joint density of d independent 
and identically distributed random variables, the RAM algorithm has, as expected, 
a stable point proportional to the identity matrix. 

Theorem 8. Assume a* G (0,1) and ir(x) = Yli=iP( x i) f or som ^ one- dimensional 
density p. Then, there exists a 9 > such that h(9I) = 0. 

Proof. Let e\, . . . , stand for the coordinate vectors of M. d . Consider the functions 



Oi{9) :- 



/ / ( / min {ir(x), n(x + r9u)} dx ] q(r)dr(u T ei) 2/ H d 1 (du). 

JS d Jo \JR d J 



Let P be a permutation matrix. It is easy to see that n(x + r9u) = 7r(P(x + r9u)) 
by the i.i.d. product form of tt. Therefore, by the change of variable Px = z and 
Pu = v, one obtains that 



a-i(9) = / / I / min {tt(z), tt(z + r9v )} dx 
Js d Jo \JR d / 

x q{r)dr{v T P T e i ) 2 U d ' 1 {dv) = aj (0) 

by a suitable choice of P. Moreover, lim^oo a,i{9) = and lime^o+ Q*i{0) = c := 
f S d(u T e i ) 2 'H d ~ 1 (du) and are continuous. Therefore, there exists a 9* > such that 
CLi{9*) = a*c, and so efh(6*I)ei = 0. 



10 



MATTI VIHOLA 



It remains to show that eih(6*I)ej = for all i ^ j. But for this, it is enough to 
show that the integrals of the form 



min {tt(z), tt(z + r9v)} dx ) q(r)dr\(v 1 Ci){v 1 Cj)\'H d l (dv 



have the same value for both Ef, '■= {v G S d : {v T ei){v T ej) > 0} and E~- := {v G 
S d : {y T ei){v T ej) < 0}. But this is obtained due to the symmetry of the sets E^- and 
E^j and the product form of tt, since 

/ min {n(z), tt(z + r9v)} dx = / min {ft{z — ^rdv), 7r(z + \r9v) } dx 

JR d jR d 

so one can change the sign of any coordinate of v without affecting this integral. This 
concludes the claim. □ 

Remark 9. Checking the existence and uniqueness in a more general setting it is out 
of the scope of this paper. It is believed that there always exists at least one solution 
5* G M. dxd such that h(S*) = 0. Notice, however, that the fixed point may not be 
always unique; see an example of such a situation for one-dimensional adaptation 
(the ASM algorithm) in [13, Section 4.4]. 

Remark 10. It is not very difficult to show that for any given target ir and proposal 
q, there exist some constants < Q\ < 9 2 < oo such that the matrices h n (9il) and 
^tt(#2-0 are positive definite and negative definite, respectively. This indicates that, 
on average, S n should shrink whenever it is 'too big' and expand whenever it is 'too 
small,' so the algorithm should admit a stable behaviour. The empirical results in 
Section [5] support the hypothesis of general stability. 

To be more precise, we can identify a Lyapunov function w n for h n in the case 7r is 
elliptically symmetric with a non-increasing tail. This will allow us to establish the 
convergence of the sequence (S n S^) n >i in Theorem [T51 

Theorem 11. Assume the conditions of Theorem^ (ii) and denote i?* := S*Sj . 
Define a function w n : M dxd — > [0, oo) by 

detR 



w n (R) := trace (R # 1 i?) — log ( 



det .R* 



-d. 



Then, for any non-singular S G M. dxd it holds that (Vw 7T (SS T ), h^S)) < with 
equality only if SS T = R* . 

Proof. Denote fr(x) : = det( J R*) 1 / 2 7r(i?y 2 x), then by Theorem E © K(S) = 
R^^h^R* 1 ^ 2 S)R l J 2 . Moreover, Theorem [7] (ii) together with Theorem E] (EHJ) im- 
ply that 7T is spherically symmetric and S = I is the unique solution of h^(S) = 
(up to orthogonal transformations). 
We can write 

Vw^SiR^Sf) = R-^\I - (SS T )- l )R^ 2 = R^ 2 Vw,(S)R^ 2 , 
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so we obtain 

(Vw w (RV 2 S(Ry 2 S) T ),K(Rl/ 2 S) = trace [Vw^S{R^S) T fK{R^)\ 

= (Vw*(S),K(S)). 

Therefore, it is sufficient to check that the claim holds for spherically symmetric fc 
with R* — I. 

Let S be non-singular and write the singular value decomposition S = U SV T where 
U and V are orthogonal and 5* = diag(si, . . . , Sd) with positive diagonal entries. By 
Theorem |6] OH} we have ha(S) = h^{SV) = h^iJJS). We may write, using the notation 
in Theorem 

trace (K(S)) = trace (U T h(US)U) = [ g(\\Sw\\) [eIi^I M<H- 

Js d L J 

We have SS T = US 2 U T , so we obtain similarly 

trace ((SS T )- 1 h*(S)) = trace (S^U* h*{SV)U S' 1 ) = [ g(\\Sw\\) fi(dw). 

Js d 

Putting everything together, 



fi(dw). 



As in the proof of Theorem [7J ^(EiLi' 5 ?' 1 ^ 2 ) 1 ^ 2 ) > exactly when Yli=i 

sjw 2 < 1 

and vice versa. The integral can equal zero only if all Si = 1. □ 

4. Validity 

This section describes some sufficient conditions under which the RAM algorithm 
is valid; that is, when the empirical averages converge to the integral 

1 - r 

(8) I n = - J2 /(**) ^> / f(x)ir(x)dx =: I 

almost surely. 

Let us start by introducing assumptions on the forms of the proposal density q and 
the target density n. 

Assumption 12. The proposal density q is either a Gaussian or a Student distribu- 
tion, that is, 

/ \ — i|l?ll 2 / \ ii n 2-, d+p 

q{z) oc e 2 11 11 or q(z) oc (1 + ||;z|| ) 2 
for some constant p > 0. 

Assumption 13. The target density it satisfies either of the following assumptions. 

(i) The density tt is bounded and supported on a bounded set: there exists a 
constant m < oo such that tt(x) = for all ||a;|| > m. 
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(ii) The density tt is positive everywhere in M. d and continuously differentiable. The 
tails of tt are super-exponentially decaying and have regular contours, that is, 



respectively 



x 

lim - — f- • V log 7r(x) = — oo and 



>oo \\X\ 



.. x Vtt(x) 

limsup j, — tt ■ .. . ... < U. 

\\x\\-yoo \\ x \\ l|V7r(x)|| 

Remark 14. Assumption [131 ensures the geometric ergodicity of the RWM algorithm 
under fairly general settings; [l6[ discuss the limitations of (Jn]) and give several ex- 
amples. 

Before stating the theorem, consider the following conditions on the adaptation 
step size sequence (f] n )n>i and on the stability of the process (S n ) n >i. 

Assumption 15. The adaptation step sizes rj n G [0, 1] are non-increasing and satisfy 

E^=l k ~ lr ln < OO. 

Assumption 16. There exist random variables < A < B < oo such that all 
the eigenvalues An of the random matrices S n S^ are almost surely bounded by 
A < Ai l) < B, for all n = 1, 2, . . . and alH = 1, . . . , d. 

Theorem 17. Suppose Assumptions [HMTSi hold and denote Qq := {A > 0, B < oo}. 
Suppose also that the function f : M. d — > M satisfies for some p G [0, 1) 

sup |/(a;)|7r~ p (a;) < oo. 

x€M. d nr(x)>0 

Then, for almost every u G flo, the strong law of large numbers (|8]) holds. 

The proof follows by existing results in the literature; the details are given in 
Appendix IBl 

The convergence of the adaptation can also be established in case tt is elliptically 
symmetric. 



Theorem 18. // the conditions of Theorem^ (TJ|j and Theorem 11 hold and addi- 
tionally Ylin^n = oo, then S n S^ —> S*S^ for almost every co G fi . 

The proof follows by Theorem [11] and results in the literature; see Appendix Q31 

Remark 19. Assumptions [l~2HT5l are common when verifying the ergodicity of an 
adaptive MCMC algorithm. Assumption [16] on stability is natural but it can be 
difficult to check with F(A > 0, B < oo) = 1 in practice. The empirical evidence 
supports this hypothesis under a very general setting; see also Remark [TU] in Section 
[3] Similar stability results have been established only for few adaptive MCMC algo- 
rithms, including the AM and the ASM algorithms 26, 28|, 29]. The precise stability 



analysis is beyond the scope of this paper. Instead, the stability can be enforced as 
described below. 

Let 0<a<6<oobe some constants so that the eigenvalues of s\sj are within 
[a, b}. Then, replace the step (R[3]) in the RAM algorithm with the following: 
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(R3') compute the lower- diagonal matrix S n with positive diagonal so that S n S^ 
equals the right hand side of (pQ). If the eigenvalues of S n S^ are within [a, b], 
then set S n = S n , otherwise set S n = S^-i. 

While this modification ensures stability, it may change the stable points of the 
algorithm and the conclusion of Theorem [18] may not hold. This could possibly be 
avoided, for example, by considering an adaptive reprojections approach [3, HI, but 
we do not pursue this here. 



5. Experiments 

The RAM algorithm was tested with three types of target distributions: heavy- 
tailed Student, Gaussian and a mixture of Gaussians. The performance of RAM 
was compared against the seminal adaptive Metropolis (AM) algorithm [l3| and an 
adaptive scaling within adaptive Metropolis ( ASWAM) algorithms jsl, 0| . Especially 
the comparison against ASWAM is of interest, since it attains a given acceptance 
rate like the RAM algorithm. 

There are several parameters that are fixed throughout the experiments. The 
adaptation step size sequence was set to r\ n = n~ 2 ^ 3 for the AM and the ASWAM 
algorithms. For the RAM approach, the weight sequence was modified slightly so 
that r] n = min{l, d ■ n~ 2 / 3 }. The extra factor was added to compensate the expected 
growth or shrinkage of the eigenvalues being of the order d" 1 ; see the proof of Theorem 
[7J The target mean acceptance rate was a* = 0.234. In all the experiments, the 
Student proposal distribution of the form q(z) = (1 + IM| 2 )~~ was used. Such a 
heavy-tailed proposal was employed in order to have good convergence properties in 
case of heavy-tailed target densities [T" 



All the tests were performed using the publicly available Grapham software [27 



the latest version of the software includes an implementation of the RAM algorithm. 

5.1. Multivariate Student distribution. The first example is a bivariate Student 
distribution with n = 1 degrees of freedom and the following location and pseudo- 
covariance matrix 



and 



0.2 0.1 
0.1 0.8 



respectively. That is, the target density n(x) oc (1 + x T Y l ~ 1 x)~ 3 / 2 . Clearly, tt has 
no second moments and thereby the empirical covariance estimate used by AM and 
ASWAM is deemed to be unstable in this example. 

Figure |2] shows the results for one hundred runs of the algorithms. The grey area 
indicates the interval between the 10% and the 90% percentiles, and the black line 
shows the median. The top row shows the logarithm of the first diagonal element 
of the matrix S n . The AM covariance grows without an upper bound as expected. 
When the scale adaptation is added, the ASWAM approach manages to keep the 
factor S n = 8 n L n within certain bounds, but there is a considerable variation that 
does not seem to vanish. This is due to the fact that L n , the Cholesky factor of 
Cov(Xi, . . . , X n ), grows without an upper bound but at the same time the scaling 
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AM 



ASWAM 



RAM 



3 
2 






200k 400k 



200k 400k 



200k 400k 




200k 



400k 



200k 400k 



200k 



400k 



Figure 2. Bivariate Student example: logarithm of the first diagonal 
component of the matrix S n (top) and the proportion of X n in the set 
A after 100,000 burn-in iterations (bottom). 



factor 6 n decays to keep the acceptance rate around the desired 23.4%. The RAM 
algorithm seems to converge nicely to a limiting value. 

Such undesided behaviour of the AM and the ASWAM algorithms may also have 
an effect on the validity of their simulation. Indeed, let us consider the 90% highest 
probability density (HPD) set of the target, that is, the set A := {x G M 2 : (x — 
[i) T T l ~ 1 (x — fi) T > 99}. Figure [2] (bottom) shows the percentage of X n outside the 
90% HPD computed after a 100,000 sample burn-in period. The AM algorithm tends 
to overestimate the ratio slightly, with more variation than the ASWAM and the RAM 
approaches. The estimate produced by the ASWAM algorithm has approximately 
the same variation as RAM, but there is a tendency to underestimate the ratio. The 
RAM estimates are centred around the true value. 

To check how the RAM algorithm copes with higher dimensions, let us follow 



24j and consider a matrix £ = MM T , where M e ~R dxd is randomly generated 
with i.i.d. standard Gaussian elements. Such a matrix £ is used as the pseudo- 
covariance of a Student distribution, so that tt(x) oc (1 + x T Yl~ 1 x)~^r . [21] showed 



that in the case of Gaussian target and proposal distributions, one can measure 
the 'suboptimality' by the factor b := d( J2i=i \~ 2 ) ( J2i=i K~ ) where A; are the 
eigenvalues of the matrix (5 , n S , ^) 1,/2 S _1//2 . The factor equals one if the matrices 
are proportional to each other, and is larger otherwise. While the factor may not 
have the same interpretation in the present setting involving Student distributions, 
it serves as a good measure of mismatch between S n S^ and E. Figure |3] shows the 
factor b in increasing dimensions each based on 100 runs of the RAM algorithm. The 
convergence of S n S^ — > £ is slower in higher dimensions, but the algorithm seems to 
find a fairly good approximation already with a moderate number of samples. 
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d = 10 d = 20 d = 30 

1.8 
1.6 
1.4 
1.2 



200/c 500/c 800& 200/c 500/c 800k 200/c 500/c 800k 

Figure 3. Suboptimality factor b over one million iterations of the 
RAM algorithm with a different dimensional Student target. 

5.2. Gaussian distribution. The multivariate Gaussian target tt(x) = jV(0, E) 
serves as a baseline comparison for the algorithms, as they should converge to the 
same matrix factoiQ S n S^ — > 6>*£. 

The algorithms were tested in different dimensions, for one thousand covariance 
matrices randomly generated as described in Section loTTl The algorithms were always 
started in 'steady state' so that X\ ~ iV(0, E). The algorithms were run half a 
million iterations: 100,000 burn-in and 400,000 to estimate the proportions of the 
samples X n in the 10%, 25%, 50%, 75% and 90% HPD of the distribution. Table 
[T] shows the overall root mean square error. For dimension two, the results are 
comparable. Surprisingly, when the dimension increases the RAM approach provides 
more accurate results than the AM and the AMS algorithms. 

One possible explanation is that in order to approximate the sample covariance, 
the covariance adaptation in AM and ASWAM should be done using the weight 
sequence r\ n = n -1 as this corresponds almost exactly to the usual sample covariance 
estimator. This setting was tried also; the results appear also in Table [TJ It seems 
that using such a sequence will indeed imply better results, when starting from s\ = I 
or s% = 10~ 4 ■ I. However, when the initial factor s\ = 10 4 • I was 'too large', this 
approach failed. This is probably due to the fact that in this case the eigenvalues of 
the covariance estimate can decay only slowly, at the speed rT 1 . 

Another explanation for the unsatisfactory performance of the AM and ASWAM 
approaches is that in the experiments the adaptation was started right away, not 
after a burn-in phase run with a fixed proposal covariance as suggested in the original 
work [13] . It is expected that the AM and the ASWAM algorithms would perform 
better by a suitable fixed proposal burn-in and perhaps with yet another step size 
sequences. In any case, this experiment demonstrates one strength of the RAM 
adaptation mechanism, namely that it does not require such a burn-in period. 

5.3. Mixture of separate Gaussians. The last example concerns a mixture of two 
Gaussians distributions in M. d with mean vectors mi := [4, 0, ... , 0] T and mi '■— —m\ 
and with a common diagonal covariance matrix S := diag(l, 100, . . . , 100). In such a 
case, the mixing will be especially problematic with respect to the first coordinate. 

1 For the AM algorithm, the constant 6* is slightly different, but approximately equal in higher 
dimensions. 




16 



MATTI VIHOLA 



Table 1. Errors in Gaussian quantiles in different dimensions. The 
step sizes r] n = n -1 were used for covariance estimation for AM* and 
ASWAM*. 



SI EE I Si EE 10~ 4 I Si EE 10 4 • / 

d 2 4 8 16 32 2 4 8 16 32 2 4 8 16 32 

AM 0.21 0.33 1.25 6.83 33.87 0.20 0.33 1.26 6.79 35.73 0.21 0.33 1.24 6.83 32.49 

ASWAM 0.22 0.32 1.23 6.67 33.78 0.21 0.34 1.25 6.67 35.77 0.21 0.33 1.23 6.63 32.11 

AM 1 0.21 0.27 0.41 0.70 1.70 0.20 0.28 0.39 0.55 2.90 6.22 27.54 53.21 57.69 58.20 

ASWAM 1 " 0.22 0.36 0.37 0.53 1.05 0.22 0.28 0.37 0.53 3.03 0.88 1.94 3.17 5.34 8.48 

RAM 0.21 0.27 0.37 0.52 1.03 0.22 0.27 0.38 0.62 2.51 0.22 0.28 0.45 0.75 1.61 



Table 2. Errors of the expectations of the first and the other coordi- 
nates in the mixture example. 

















x m 


X (d) 




d 


2 


4 


8 


16 


32 


2 


4 


8 16 


32 


AM 

ASWAM 
RAM 


0.04 
0.04 
0.07 


0.05 
0.06 
0.21 


0.08 
0.10 
0.66 


1.69 
1.82 
1.34 


3.87 
3.86 
1.77 


0.08 
0.08 
0.05 


0.11 
0.11 
0.08 


0.15 0.19 
0.14 0.18 
0.11 0.16 


0.27 
0.27 
0.29 



Table |2] shows the root mean square error of the expectation of the first coordinate 
and the overall error for the rest X^ 2 \...,X^. The errors in the first coordinate 
for the RAM are significantly higher than for the AM and the ASWAM for dimensions 
2, 4 and 8. The estimates from all the algorithms are already quite unreliable in 
dimension 16. For the latter coordinates, the RAM approach seems to provide better 
estimates. Observe also that when comparing ASWAM with AM, the results are also 
worse in the first coordinate and better in the rest, like in the RAM approach. This 
indicates that the true optimal acceptance rate is here probably slightly less than the 
enforced 23.4%. 

The example shows how the RAM approach finds the 'local shape' of the distri- 
bution. In fact, it is quite easy to see what happens if the means of the mixture 
components would be made further and further apart: there would be a stable point 
of the RAM algorithm that would approach the common covariance of the mixture 
components. Such a behaviour of the RAM approach is certainly a weakness in cer- 
tain settings, as this example, but it can be also advantageous. Notice also that 
even such a simple multimodal setting poses a challenge for the random walk based 
approaches. 

6. Discussion 

A new robust adaptive Metropolis (RAM) algorithm was presented. The algorithm 
attains a given acceptance probability, and at the same time finds an estimate of 
the shape of the target distribution. The algorithm can cope with targets having 
arbitrarily heavy tails unlike the AM and ASWAM algorithms based on the covariance 
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estimate. The RAM algorithm has some obvious limitations. It is not suitable 
for strongly multi-modal targets, but this is the case for any random walk based 
approach. For sufficiently regular targets, it seems to work well and the experiments 
indicate that RAM is competitive with the AM and ASWAM algorithms also in case 
of light-tailed targets having second moments. 

There are several interesting directions of further research that were not covered 
in the present work. The RAM algorithm can be used also within Gibbs sampling, 
that is, when updating a block of coordinate variables at a time instead of the whole 
vector. This approach is often very useful especially when the target distribution rr 
consists of a product of conditional densities, which is often the case with Bayesian 
hierarchical models. In such a setting, the computational cost of evaluating the 
ratio Tc(y)/Tr(x) after updating one coordinate block can be significantly less than the 
full evaluation of n(y). It would also be worth investigating the effect of different 
adaptation step sizes, perhaps even adaptive ones as suggested by j3|. 

Regarding theoretical questions, the existence and uniqueness of the fixed points of 
the approach could be verified in a more general setting; the present work only covers 
elliptically symmetric and product type target densities, which are too restrictive in 
practice. The experiments indicate the overall stability of the RAM algorithm; see 
also Remark [TDJ However, proving the stability of RAM without prior bounds is di- 
rectly related to the more general open question on the stability of adaptive MCMC 
algorithms, or even more generally to the stability of stochastic approximation. Hav- 
ing the stability and more general conditions on the fixed points, one could also prove 
the convergence of S n in a more general setting. 

Acknowledgements 

The author was supported by the Finnish Centre of Excellence in Analysis and 
Dynamics Research and by the Finnish Graduate School in Stochastics and Statistics. 
The author thanks Professor Christophe Andrieu and Professor Eric Moulines for 
useful discussions on the behaviour of the algorithm. 

References 

[1] C. Andrieu and E. Moulines. On the ergodicity properties of some adaptive 
MCMC algorithms. Ann. Appl. Probab., 16(3):1462-1505, 2006. 

[2] C. Andrieu and C. P. Robert. Controlled MCMC for optimal sampling. Technical 
Report Ceremade 0125, Universite Paris Dauphine, 2001. 

[3] C. Andrieu and J. Thorns. A tutorial on adaptive MCMC. Statist. Comput., 18 
(4):343-373, Dec. 2008. 

[4] C. Andrieu, E. Moulines, and S. Volkov. Convergence of stochastic approxima- 
tion for Lyapunov stable dynamics: a proof from first principles, 2004. preprint. 

[5] C. Andrieu, E. Moulines, and P. Priouret. Stability of stochastic approximation 
under verifiable conditions. SI AM J. Control Optim., 44(1):283-312, 2005. 

[6] Y. Atchade and G. Fort. Limit theorems for some adaptive MCMC algorithms 
with subgeometric kernels. Bernoulli, 16(1):116-154, Feb. 2010. 



18 MATTI VIHOLA 

[7] Y. F. Atchade and J. S. Rosenthal. On adaptive Markov chain Monte Carlo 
algorithms. Bernoulli, ll(5):815-828, 2005. 

[8] A. Benveniste, M. Metivier, and P. Priouret. Adaptive Algorithms and Stochastic 
Approximations. Number 22 in Applications of Mathematics. Springer- Verlag, 
1990. ISBN 0-387-52894-6. 

[9] V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Cam- 
bridge University Press, 2008. ISBN 978-0-521-51592-4. 
[10] J. J. Dongarra, J. R. Bunch, C. B. Moler, and G. W. Stewart. UNPACK Users 1 
Guide. Society for Industrial and Applied Mathematics, 1979. ISBN 0-89871- 
172-X. 

[11] A. Gelman, G. O. Roberts, and W. R. Gilks. Efficient Metropolis jumping rules. 

In Bayesian Statistics 5, pages 599-607. Oxford University Press, 1996. 
[12] W. R. Gilks, S. Richardson, and D. J. Spiegelhalter. Markov Chain Monte Carlo 

in Practice. Chapman & Hall/CRC, Boca Raton, Florida, 1998. ISBN 0-412- 

05551-1. 

[13] H. Haario, E. Saksman, and J. Tamminen. An adaptive Metropolis algorithm. 
Bernoulli, 7(2):223-242, 2001. 

[14] D. Hastie. Toward Automatic Reversible Jump Markov Chain Monte Carlo. PhD 
thesis, University of Bristol, Mar. 2005. 

[15] P. J. Huber. Robust Statistics. Wiley series in probability and mathematical 
statistics. Wiley, 1981. ISBN 0-471-41805-6. 

[16] S. F. Jarner and E. Hansen. Geometric ergodicity of Metropolis algorithms. 
Stochastic Process. Appl, 85:341-361, 2000. 

[17] S. F. Jarner and G. O. Roberts. Convergence of heavy-tailed Monte Carlo Markov 
chain algorithms. Scand. J. Stat., 34(4):781-815, Dec. 2007. 

[18] H. J. Kushner and G. G. Yin. Stochastic Approximation and Recursive Algo- 
rithms and Applications. Number 35 in Applications of Mathematics: Stochastic 
Modelling and Applied Probability. Springer- Verlag, second edition, 2003. ISBN 
0-387-00894-2. 

[19] E. Nummelin. MC's for MCMC'ists. International Statistical Review, 70(2): 
215-240, 2002. 

[20] C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer- Verlag, 
New York, 1999. ISBN 0-387-98707-X. 

[21] G. O. Roberts and J. S. Rosenthal. Optimal scaling for various Metropolis- 
Hastings algorithms. Statist. Sci., 16(4):351-367, 2001. 

[22] G. O. Roberts and J. S. Rosenthal. General state space Markov chains and 
MCMC algorithms. Probability Surveys, 1:20-71, 2004. 

[23] G. O. Roberts and J. S. Rosenthal. Coupling and ergodicity of adaptive Markov 
chain Monte Carlo algorithms. J. Appl. Probab., 44(2):458-475, 2007. 

[24] G. O. Roberts and J. S. Rosenthal. Examples of adaptive MCMC. J. Comput. 
Graph. Statist, 18(2):349-367, 2009. 

[25] G. O. Roberts, A. Gelman, and W. R. Gilks. Weak convergence and optimal 
scaling of random walk Metropolis algorithms. Ann. Appl. Probab., 7(1):110-120, 



ROBUST ADAPTIVE METROPOLIS 



19 



1997. 

[26] E. Saksman and M. Vihola. On the ergodicity of the adaptive Metropolis algo- 
rithm on unbounded domains. Ann. Appl. Probab., 20(6):2178-2203, Dec. 2010. 

[27] M. Vihola. Grapham: Graphical models with adaptive random walk Metropolis 
algorithms. Comput. Statist. Data Anal, 54(l):49-54, Jan. 2010. 

[28] M. Vihola. On the stability and ergodicity of adaptive scaling Metropolis algo- 
rithms. Preprint. larXiv:0903.406lV 3. Apr. 2011. 

[29] M. Vihola. Can the adaptive Metropolis algorithm collapse without the covari- 
ance lower bound? Electron. J. Probab., 16:45-75, 2011. 

Appendix A. Regularity of directional mean acceptance probability 

Proposition 20. Let n and q be probability densities on M. d and on (0,oo) ; respec- 
tively, and let v G M. d be a unit vector. The function g : (0, oo) — > (0, oo) defined 
by 

poo I* 

9(@) := / / niin {7r(x), 7r(x + r6*f )} dxg(r)dr 



is continuous, \im.Q^ 00 g{9) = and lim e ^ 0+ g{9) = 1. 

Proof. Suppose first that tc is a continuous probability density on ~R d . Then, write 

g(u) = / mm < 1, — — > iT(x)dxq{r)dr 

Jo J A I n { x ) J 

where A := {x G M. d : ir(x) > 0} stands for the support of it. Let {6 n ) n >\ C 

(0, oo) be any sequence and define fe(x,r) := min { 1 , ^ X ^Z\ } ■ Clearly, whenever 

n converges to some 9, then fe n (x,r) — > fg(x,r) pointwise on A x (0, oo) by the 

continuity of 7r. Since \f n (x,r)\ < 1, the dominated convergence theorem yields that 

\g(9n) — g(6)\ ~^ 0, establishing the continuity. For any sequence 9 n — > 0+ one 

clearly has fe n (x,r) —> 1, and for any sequence 6 n — > oo one obtains fe n (x,r) — > 0, 

establishing the claim. 

Let us then proceed to the general case. Let e > be arbitrary. We shall show 

that there exists a continuous probability density tt on W 1 such that 

\fr(x) — 7r(a;)|da; < e. 



< / \n(x) — n(x)\dx < e 



Having such tt, one can bound the difference 

g[p) — / mm < 1, — > 7r{x)axq{r)ar 

Jo J A I J 

establishing the claim. 

Let us finally verify that such a continuous probability density tc exists. Approxi- 
mate ii first by smooth non-negative functions n n such that f Rd \tt(x) — n n (x)\dx — > 0, 
and then normalise them to probability densities 7t n (x) '■= c n 7t n (x). Clearly, the con- 
stants c n := (J Rd TT n (z)dz)~ 1 -)■ 1, and so f Rd \ it (x) - n n (x) \ dx < j Rd |7r(x)-7r„(x)|dx+ 
ll-cl-^O. □ 
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Appendix B. Proofs of convergence 



Theorem 1T_, Let 0<a<6<oobe arbitrary constants and denote by E> a ,b C 
M. dxd the set of all lower triangular matrices with positive diagonal, such that the 
eigenvalues of ss T are within [a, b]. Let P s stand for the random walk Metropolis 
kernel with a proposal density q s (z) := det(s)~ 1 q(s~ 1 z), that is, for any x G M d and 
any Borel set A C R d 

P s {x,A) := l A (x) (1- [ min\l,^\\q s (y-x)dy 



+ f min (l, ^i) Qs(y ~ x)dy. 



We shall use the notation P s f(x) ■= L d f(y)P s (x, dy) to denote the integration of a 
function with respect to the kernel P s . 

Let us check that the following assumptions are satisfied. 

(Al) For all possible s G S a) 6, the kernels P s have a unique invariant probability 
distribution it for which j Rd P(x, A)n(dx) = ir(A) for any Borel set A C M. d . 

(A2) There exist a Borel set C C R d , a function V : R d ->■ [1, oo), constants 5, A G 
(0, 1) and 6 < oo, and a probability measure z/ concentrated on C such that 

P s V(x) < \V(x) + l c (a;)6 and 
P s (x,A) > l c (x)6v(A) 

for all possible x G M d , s G S ai b and all Borel sets A C IK 6 '. 
(A3) For all n > 1 and any r G (0, 1], there is a constant d = d(r) > 1 such that for 
all s, s' G S a ,6, 

\P s f(x)-P s ,f(x)\ \f(x)\ 

sup , . < c |s — s I sup . 

xeR d V\x) xgR d l/ r (x i ) 

(A4) There is a constant c < oo such that for all n > 1, s G § a ,&, i6l d and «eK d 
the bound \H(s,x,u)\ < c holds. 

The uniqueness of the invariant distribution (AH]) follows by observing that the kernels 
P s are irreducible, aperiodic and reversible with respect to 7r [see, e.g. 0. The 



simultaneous drift and minorisation condition (AT5]) and the continuity condition was 
established by [l| . The continuity condition (A[3]) was established by [l| for Gaussian 
proposal distributions and was extended to cover the Student proposal in [28|. The 
bound (ATJJ) is easy to verify. 

Assumption [TH] ensures that for any e > there exist constants < a t < b e < oo 
such that all the eigenvalues of S n S^ stay within the interval [a e , b e ] at least with 
probability P(Oq) — e - This is enough to ensure that the strong law of large numbers 



holds by Proposition 6]. For details, see also [26|, Theorem 2] and [28|, Theorem 
20]. ' ~ □ 

Theorem [73. The proof follows by 0, Theorem 5] by using a similar technique as in 
the proof of Theorem [TT] Consider the Lyapunov function w 1T {R) defined in Theorem 
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ITTl It is straightforward to verify items 1-4 of [4j, Condition 1] when we take to 
be the space of symmetric positive definite matrices and consider S n S^ G G. The 
compact sets are of the form K = {ss T : s e §a e ,f> £ } with a t ,b 6 as in the proof 
of Theorem [T71 Item 5 follows by invoking (26l . Proposition 6] with fg((x, «)) = 
H(Q,x,u). □ 
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